Web Classification Approach Using Reduced Vector Representation Model Based on Html Tags
نویسندگان
چکیده
Automatic web page classification plays an essential role in information retrieval, web mining and web semantics applications. Web pages have special characteristics (such as HTML tags, hyperlinks, etc....) that make their classification different from standard text categorization. Thus, when applied to web data, traditional text classifiers do not usually produce promising results. In this paper, we propose an approach which categorizes web pages by exploiting plain text and text contained in HTML tags. Our method operates in two steps. In step 1, we use Support Vector Machine classifier (SVM) to generate, for each target web page (page to classify), reduced vector representation based on plain text and text from HTML tags. In Step 2, we submit this vector representation to Naive Bayes (NB) algorithm to determine the final class for the target page. We conducted our experiments on two large datasets of pages from ODP (Open Directory Project) and WebKB (Web Knowledge Base), which are accidentally discovered to suffer from a lot of missing HTML tags. The results prove that NB classifier, supported by our model and using HTML tags content combined with plain text, (1) performs significantly better than NB classifier using text alone in terms of both Micro-F1 and Macro-F1 measures and even with the presence of missing HTML tags, (2) performs consistently with respect to category distribution and (3) outperforms NB classifier, using text alone, simply with the use of very basic handling techniques of missing HTML tags.
منابع مشابه
Model-Based Classification of Web Documents Represented by Graphs
Most web content classification methods are based on the vectorspace model of information retrieval. One of the important advantages of this representation model is that it can be used by both instance-based and model-based classifiers for categorization. However, this popular method of document representation does not capture important structural information, such as the order and proximity of...
متن کاملThe hybrid representation model for web document classification
Most web content categorization methods are based on the vector-space model of information retrieval. One of the most important advantages of this representation model is that it can be used by both instance-based and model-based classifiers. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurre...
متن کاملWeb Page Structure Enhanced Feature Selection for Classification of Web Pages
Web page classification is achieved using text classification techniques. Web page classification is different from traditional text classification due to additional information, provided by web page structure which provides much information on content importance. HTML tags provide visual web page representation and can be considered a parameter to highlight content importance. Textual keywords...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملWeb Documents Categorization using Fuzzy Representation and HAC
Most of the existing techniques for characterization of Web documents are based on term-frequent), analysis. In such models, given a set of documents, the characterization of each document is represented by a feature vector in a vector space. Howevel; as Web documents written in HTML are semi-structured documents by means of tags, the traditional techniques that assign term weights only by the ...
متن کامل